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\  ABSTRACT 

\  ' 

The  young  field  of  statistical  diagnostics  has  produced  an  array  of 
competing  statistics  for  measuring  the  influence  of  individual  cases.  Two  of 
the  most  popular  measures  for  linear  regression  are  Cook's  (1977)  D^  and\. 
Belsley,  Kuh  and  Welsch's  (1980)  DFFITS^.  Using  the  likelihood  displacement 
(Cook  and  Weisberg,  1982)  as  a  unifying  concept,  these  two  measures  are  / 
compared. 
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SIGNIFICANCE  AND  EXPLANATION 


" The  identification  of  influential  cases  seems  generally  accepted  as  an 
important  part  of  linear  regression  analysis.  Although  there  are  many  diag¬ 
nostic  methods  available  for  this,  two  specific  diagnostic  statistics^p^  as 
proposed  by  Cook  and  DFFIJTS^  as  proposed  by  Belsley,  Kuh  and  Welsch 

( 1980) — appear  to  be  used  most  frequently  since  they  are  available  in  many 
widely  distributed  regression  packages.  For  further  progress  and  a  deeper 
understanding  of  available  methodology,  larger  perspectives  seem  necessary.  , 
We  have  found  the  likelihood  displacement  to  be  particularly  well-suited  for 
this  study  . 
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THE  LIKELIHOOD  DISPLACEMENT:  A  UNIFYING 
PRINCIPLE  FOR  INFLUENCE  MEASURES 

R.  Dennis  Cook*,  Daniel  Pena**  and  Sanford  Weisberg* 


ft 


1 .  INTRODUCTION 

The  Identification  of  Influential  cases  seems  generally  accepted  as  an 
Important  part  of  linear  regression  analysis.  Although  there  are  many 
diagnostic  methods  available  for  this,  two  specific  diagnostic  statistics-^ 
as  proposed  by  Cook  (1977) ,  and  DFFITS^  as  proposed  by  Belsley,  Kuh  and  Welsch 
(1980) —appear  to  be  used  most  frequently  since  they  are  available  in  many 
widely  distributed  regression  paokages. 

A  number  of  authors.  Including  Atkinson  (1981),  Belsley,  Kuh  and  Welsch 
(1980),  Cook  and  Weisberg  (1982),  Hoaglin  and  Welsch  (1978)  and  Welsch  (1982), 
use  special  pleading  to  Justify  the  use  of  D^  or  DFFITS^,  generally 
concentrating  on  Isolated  characteristics  of  these  statistics.  Although 
useful,  such  narrow  arguments  are  not  likely  to  resolve  Important  differences 
or  even  allow  bilateral  recognition  of  alternative  views.  One  way  to  further 
understand  this  is  to  cast  both  diagnostics  into  a  common  framework  so  that 
they  can  be  Judged  in  a  larger  perspective.  Such  a  framework  is  provided  by 
the  likelihood  displacement  (distance)  as  developed  by  Cook  and  Weisberg 
(1982,  p.  182). 

In  section  2  we  review  the  likelihood  displacement  and  the  central  results 
for  linear  regression.  In  section  3  we  show  that  both  Dt  and  DFFITSj  fit 
conveniently  into  this  framework,  and  address  some  of  the  specific  arguments 
alluded  to  above.  Section  4  contains  our  concluding  comments. 
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2.  LIKELIHOOD  DISPLACEMENT 

T  T  T 

Let  6  be  a  p«l  parameter  vector  partitioned  as  0  -  where  0j  is 

pjxl,  and  let  L(0;Z)  -  LCe^.G^Z)  denote  the  log  likelihood  function  for  0 

based  on  data  Z.  To  help  with  later  ideas,  Figure  1  illustrates  the  contours 

*T  ''T  *T 

of  L(0;Z)  when  p-2.  The  maximum  likelihood  estimate  (mle)  0  -  (@1 »02)  is 
indicated  in  Figure  1  by  the  point  F. 

A 

In  influence  analysis  we  often  wish  to  compare  the  full  data  mle  6  to  the 

at  ~t 

mle  0^j  -  (@1  (i)»®2(i))  obtaine<i  from  the  log  likelihood  L(0;Z^j)  where  the 
subscript  "(i)"  means  "without  case  i".  One  useful  and  general  method  for 

A  A 

comparing  0  and  0^^  is  based  on  the  likelihood  displacement 

LD^O)  -  2[L(0;Z)  -  L(6(1);Z)]  (1) 

In  Figure  1 ,  this  displacement  corresponds  to  computing  twice  the  difference 

A  A 

in  the  heights  of  the  full  data  log  likelihood  at  0  and  at  0^j.  If  this 
difference  in  heights  is  large,  case  i  is  called  influential  since  deleting  it 
may  cause  a  substantial  change  in  important  conclusions.  The  likelihood 
displacement  Judges  all  cases  falling  on  the  same  contour  of  L  to  be  equally 
influential.  If  desirable,  this  displacement  can  be  transformed  to  a  more 
familiar  scale  by  comparing  it  to  percentiles  of  a  chi-squared  distribution 
with  p  degrees  of  freedom.  This  comparison  gives  the  level  of  the  smallest 

A 

likelihood  region  for  0  that  contains  8^  (Cox  and  Hinkley,  1974,  Chapter  9). 

In  many  problems,  a  subset  of  0  can  be  regarded  as  nuisance  parameters  so 
that  only  the  remaining  parameters  are  of  interest.  Suppose  that  01  is  of 
interest  while  02  represents  the  nuisance  parameters.  Define  the  implicit 
function  g(0^ ) ,  such  that,  for  fixed  0^ ,  L(@1 ,g(0^ ) ;Z)  is  maximized;  g(0^)  is 


given  as  a  curved  line  in  Figure  1 .  The  likelihood  displacement  for  6^ 
ignoring  @2  can  now  be  defined  as 

LD1(01|e2)  -  2{L(0;Z)  -  I-Ce, (1).8(01 <i)^ *z3)  (2) 

A 

In  Figure  1,  the  point  P  is  obtained  by  moving  the  point  0^  parallel  to  the 
02  axis  until  it  reaches  the  curve  g.  Then  |e2)  is  Just  twice  the 

difference  in  height  of  the  point  F  and  the  point  P.  Again,  LO^O^S^)  may  be 
calibrated  by  comparison  to  the  percentiles  of  a  chi-squared  distribution,  now 
with  p^  degrees  of  freedom. 

It  is  fairly  straightforward  to  apply  the  general  results  (1)  and  (2)  to 
the  standard  linear  regression  model 

Y  ■  X0  ♦  e  (3) 

where  Y  -  (y^)  is  an  n*1  vector  of  observable  responses,  the  n*p  matrix  X  is 

known  and  has  full  rank,  0  is  a  p*1  vector  of  unknown  parameters  and  the  n*1 

vector  of  unobservable  errors  e  is  at  least  tentatively  assumed  to  follow  a 

2  A 

multivariate  normal  distribution  with  mean  0  and  variance  o  I.  Let  0  and  a 

2 

denote  the  maximum  likelihood  estimators  of  0  and  a  ,  respectively,  and  let 

J  -IT  * 

H  »  X(X  X)  X  so  that  the  fitted  values  Y  and  the  residuals  e  can  be  written 

A 

Y  -  Hy  and  e  -  (I-H)Y.  The  diagonal  elements  of  H  will  be  denoted  by  h^. 

Cook  and  Weisberg  (1982)  show  that 


LD1<S|a2)  -  n  log  I-Jjj-  Di  ♦  t] 


<*0 


where  is  the  statistic  proposed  by  Cook  (1977): 


Dj  -  (B  -  $(1))TXTX(6  -  6{1))/ps2 


II*  *  *(1)I|2'P*2 


1-hx  p 


(5) 


2  T  1/2 

where  s  -  e  e/(n-p),  and  r^  •  e./s(1-h^)  is  the  i-th  internally 


Studentized  residual.  Since  LD^Cejo  )  is  a  monotonlc  function  of  Dj,  it  is 


equivalent  to  for  the  purpose  of  ordering  cases  based  on  influence.  When 


2  2  2 
o  is  known,  LD^S)  is  equal  to  D1  with  pa  replaced  by  o  . 


3.  LDt  and  DFFITSt 


All  of  the  statistics  considered  here  depend  on  the  leverages  h1  and  the 


residuals  e^.  For  later  convenience,  define 


bi  - 


e  «(1*hj) 


i-1,2 . n 


(6) 


Under  model  (3)  b^  has  a  beta  distribution  with  parameters  1/2  and  (n-p-1 )/2 
Using  (4),  (5)  and  (6)  it  is  immediate  that 


D  -  -in-~P)  b  -i 
i  p  Di  1-h. 


and  thus 


6 


LD^elo2)  -  n  log  f  +  1  } 


We  now  turn  to  the  statistic  DFFITS^  which  Is  defined  as  (Belsley,  Kuh  and 
Welsch  1980) 


DFFITS: 


Using  the  relationship  (Cook  and  Weisberg,  1982,  eq.  (2.2.8)) 


,(1)  n  M-h  1 

!2 - nri'(1bi) 


it  follows  easily  that  DFFITS^  can  be  expressed  in  the  form 


b  h 
1  1 


DFFITS^  -  (n-p-'l 


2  2 

We  shall  also  require  expressions  for  LD^(6,o  )  and  LD^(o  | B ) ;  these  are 
derived  in  the  Appendix  to  be 


LD^e.a  ) 


n  log^)  ♦  n  logd-b^  ♦ 


.v.v.v.v; 


*  ji 


s  / 


r  i. 


nb.*1 


«  MV  • 

LD1(oz|0)  -  n  log(jpj-)  ♦  n  logd-bj)  ♦  ^ 


Equation  (It)  depends  only  on  b^  and  not  the  leverage  h^.  Since  b^  is  a 

aonotonic  transformation  of  the  usual  teat  statistic  for  a  mean  shift  outlier, 

2 

the  study  of  the  likelihood  displacement  for  o  ignoring  0  is  equivalent  to 
the  study  of  mean  shift  outliers. 

2 

The  full  likelihood  displacement  LD^(0,o  )  is  monotonlcally  increasing  in 

h1#  as  is  clear  from  an  Inspection  of  (10).  In  general,  h{  2  0  and  for  models 

*  1 

with  a  constant  h1  2  n  .  A  sufficient  condition  for  (10)  to  be  monotonlc  in 

p  p 

b(  is  hi  2  n  .  Interestingly,  LD^g.a  )  reduces  to  LD^(o  |0)  when  hj  is 
replaced  with  its  minimum  value  h^  -  0.  In  other  words,  when 
hl  -  0,  LDj (0 |c2)  -  0  and  LD1(0,o2)  -  LD&(«2|0). 

We  now  relate  DFFITS^  to  the  likelihood  displacement  by  subtracting 
LD1(o2|0)  from  LD^g.c2), 


LD1(0,c2)  -  LDj (o2|0) 


W1  (n-1) 

1’bi 


nb^'-l 


>  T=K^ 


Comparing  (12)  and  (9)  we  see  that 


>  » 


LD^g.o2)  -  LD1(e2|9)  -  DFFITS2  .  (!) 


The  factor  (n->1  )/(n>-p<“1 )  appears  in  this  fundamental  relationship  since  the 


9 


g2(®2) 


l  Vi 

r  2 

l  *t 


(17) 


2  2 
We  see  that  g^(o  )  does  not  depend  on  a  . 

As  a  special  case  of  this  problem,  we  take  n-4  and  (x^.y^)  -  (0,0), 

(.2, .2),  (.2, -.2),  (/.92,  /.92).  For  these  data  ||x||  -  ||Y||  -  1,  and 

all  points  but  the  third  fall  on  a  common  line.  The  all-but-one-polnt-on-a- 

llne  problem  is  mentioned  by  Dempster  and  Green  (1981),  and  promoted  by  Welsch 

(1982)  as  a  reason  for  the  use  of  DFFITS^  over  The  general  idea  Is  that 

DFFITSj^  will  always  find  the  point  that  lies  off  the  line  to  be  most 
*2 

influential  since  -  0,  while  may  Identify  a  point  on  the  line  as  most 

influential,  a  circumstance  that  is  evidently  counter  to  Welsch's  (1982) 

intuition.  Although  this  example  is  relatively  simple,  its  essential 

characteristics  are  perfectly  general. 

Table  1  lists  the  maximum  likelihood  estimates  (B,<J  )  and  (6^  ,o^), 

2 

i*1t2f3t^*  Figure  2  gives  a  contour  plot  of  L(8,o  ;Z)  as  defined  in  (14).  In 

2 

addition,  g^B),  equation  (16),  is  indicated  by  the  short  dashes,  and  g2(o  ), 

A  A« 

equation  (17),  is  indicated  by  the  long  dashes.  The  peak  of  L(8,o  )  is 

A  A  A 

indicated  by  "F"  and  has  value  given  by  (15).  The  points  (B^.o^)  are 
marked  by  i-1,2,3,4. 

The  four  Influence  measures  given  in  (7),  (9),  (10)  and  (11)  correspond  to 

the  differences  in  heights  between  various  points  in  Figure  2.  Consider  case 

2 

4,  for  example.  The  full  likelihood  displacement  LD^(B,o  )  is  simply  twice 
the  difference  in  the  heights  of  the  points  located  at  "F"  and  "4".  For  the 
measure  LD^(B|o  ),  the  point  "4"  is  moved  parallel  to  the  ordinate  until  it 
falls  on  the  curve  g^B);  the  final  position  is  indicated  by  "4A"  in  Figure  2. 
Now  LD^(B|a^)  is  just  twice  the  difference  in  the  heights  of  the  points  at  "F" 


Table  1 


Maximum  likelihood  estimates  for  simple  regression  through  the  origin 

A  A  2 

Index  Case  Deleted  B  <> 


F 

none 

.920 

.0382 

1 

(0,0) 

.920 

.0509 

2 

(.2, .2) 

.917 

.0511 

3 

(.2,-. 2) 

1 

0 

H 

</3E7  /75IT) 

0 

.0266 

and  "HA".  Similarly,  LD^(o^|8)  is  obtained  by  using  the  heights  at  "F"  and 
"MB". 

p  p  p 

Each  of  the  measures  LD^B.o  ),  LD^(6|o  )  and  LD^ ( o  | 6 )  uses  the  maximum 

2 

of  L  as  a  reference  for  assessing  influence.  In  contrast,  DFFITS^  assesses 

influence  by  using  the  heights  of  points  "4"  and  "4B",  both  of  which  lie  on 

2 

the  side  of  L.  If  DFFITS^  is  useful  then  surely  the  analogous  measure 

obtained  by  using  point  "4"  and  "4A"  is  useful  also. 

An  inspection  of  Figure  2  yields  the  following  qualitative  conclusions. 

First,  cases  1  and  2  are  relatively  uninfluential.  Second,  case  4  is 

2  2 

influential  for  (B,o  )  and  B,  but  not  for  a  alone.  Finally,  case  3  is 

2  2 

influential  for  (B,o  )  and  a  ,  but  not  for  B  alone.  Notice  that  "3"  falls 

A 

Just  to  the  right  of  the  vertical  line  (17)  at  B  -  B  -  .92  where  L  -  *  •. 

Returning  to  the  all-polnts-but-one-on-a-line  problem,  we  now  see  that 
LD1(B|oc)  will  not  Identify  case  3  to  be  the  most  influential  since  "3"  will 

p 

be  moved  from  to  the  gj(B)  curve  prior  to  the  computation  of  LD1(B|o  ). 

2 

This  movement  loses  all  information  on  changes  in  o  ,  but  is  essential  if  we 
are  to  isolate  changes  in  location  as  LD^ (B f o  )  is  designed  to  do. 


3.2  Contour  Comparisons 

Further  insights  can  be  obtained  by  comparing  the  contours  of  the  four 

2  o 

measures  in  the  (b^.h^)  plane.  The  contours  for  LD^(B,o  ),  LD^ (B | a  )  and 
2 

DFFITSj  are  given  in  Figures  3~5,  respectively.  Recall  that 

2  2  o 

LD1(o  |B)  -  LD^(B,o  )  when  h^  -  0;  thus  the  contours  for  LD^(<j  |B)  are 

parallel  to  the  x-axis  and  they  intersect  the  y-axis  at  the  same  points  as  the 
2 

contours  of  LD^B.o  )  in  Figure  3. 

According  to  Welsch  (1982),  DFFITSj^  is  designed  to  measure  changes  in 
location  and  scale  simultaneously.  With  this  in  mind,  we  first  compare 


Figures  3  and  5.  The  contours  In  these  two  figures  are  remarkably  similar 

2  2 

when  <  h^;  when  this  condition  holds  we  can  expect  DFFITS j  *  LD^(B,o  ). 

2 

When  >  h^ ,  the  two  sets  of  contours  are  quite  different  and  LD^(B,o  )  is 

considerably  more  sensitive  to  increases  in  b^.  Evidently,  DFFITSj  is  not 

sufficiently  sensitive  to  changes  in  scale.  Numerical  illustrations  of  this 

insensitivity  are  easily  constructed.  Suppose,  for  example,  that  b^  -  .99  so 

that  from  (8)  o2^  =  .01  o2.  With  bt  fixed  at  .99,  DFFITS2  can  be  made 

arbitrarily  small  by  letting  h^  *  0.  Under  these  same  conditions,  however, 

2  2 

LDj (8 ,o  )  LD^o  jB).  This  example  can  be  used  to  formulate  a  more  realistic 
all-polnts-but-one-nearly-on-a-llne  problem  in  which  DFFITS^  may  fail  to  find 
the  point  that  is  far  from  the  line. 

A  variety  of  other  useful  insights  can  be  obtained  by  comparing  Figures  3“ 

5.  For  example,  LD1(B|o  )  responds  primarily  to  h^  while  LD^(o  [B)  is 

Independent  of  hj.  Clearly,  leverage  is  more  Important  for  changes  in 

coefficients  while  outliers  (as  reflected  by  b^)  are  Important  for  changes  in 

2 

scale.  When  examining  Figures  3*5  it  should  be  remembered  that  only  DFFITS^ 

2 

and  LDj (8,o  )  are  directly  comparable  since  the  other  measures  concentrate  on 
selected  aspects  of  the  problem. 

Atkinson  (1981)  indicates  a  preference  for  measures  like  DFFITS^  since 

they  emphasize  outliers  more  than  D^.  Relative  to  the  likelihood 

2 

displacement,  such  emphasis  is  insufficient  if  both  6  and  a  are  of  interest 
and  is  oversufficient  if  interest  centers  on  8  alone.  Generally,  Figures  3*5 
show  that  DFFITS2  lies  between  LD^(8,o2)  and  LD^(B|o2)  when  b^  >  h^. 

Welsch  (1982)  favors  yet  another  measure  of  Influence  that  can  be  written 


i7 


(n-1 ) 

1-h. 


•  DFFITS: 


(1»«) 


This  measure  is  intended  to  reflect  the  influence  of  cases  on  location,  scale 
and  the  shape  of  the  covariance  matrix.  From  the  above  discussion  it  seems 
clear  that  the  shape  information  is  coming  at  the  substantial  expense  of 
Information  on  coefficients  and  scale.  Perhaps  it  is  unwise  to  expect  so  much 
information  from  a  single  number. 


DISCUSSION 

Many  of  the  Initial  developments  in  the  area  of  influence  assessment  are 
based  on  ad  hoc  reasoning,  as  often  happens  during  the  Infancy  of  any  new 
methodology.  For  further  progress  and  a  deeper  understanding  of  available 
methodology,  larger  perspectives  seem  necessary.  We  have  found  the  likelihood 
displacement  to  be  particularly  well-suited  for  the  study  of  influence, 
although  other  reasonable  frameworks  are  possible,  of  course.  For  example, 
Johnson  and  Gelsser  (1983)  adopt  a  predictlvist  view. 

2 

Within  the  likelihood  framework,  we  conclude  that  LD^(6,a  )  is  the  most 

useful  one-number  summary  of  Influence  in  the  absence  of  more  specific 

concerns.  This  conclusion  follows  from  two  observations.  First,  LD^(0|o  ) 

2  2 

and  LD^Co  |ft)  are  bounded  above  by  LD^(0,o  ).  Cases  that  are  unlnfluential 

2  2 
for  (g,a  )  must  therefore  be  unlnfluential  for  0  and  o  considered  separately. 

The  specific  concerns  reflected  by  LD ^ ( 8 1 o 1  )  and  LD^(o  |g)  need  to  be 

2 

addressed  only  when  LD^(8,o  )  is  sufficiently  large.  Second,  DFFITS^  and 

related  measures  like  Atkinson's  (1981,  1982)  modified  Cook  statistic  will  be 

2 

essentially  equivalent  to  LD^(B,o  )  when  h^  >  b^ ;  otherwise  these  measures  are 
not  sufficiently  sensitive  to  changes  in  scale. 


Since  coefficients  are  often  a  major  concern  In  linear  regression, 

2 

LD^ ( B | 0  )  or,  equivalently,  can  be  added  to  give  a  useful  two-number 

T  T  T 

summary  of  Influence.  If  a  subset  B1  o f  B  -(B^BgJisof  special  interest, 
2 

LDjlBlo  )  can  be  refined  further  by  using  the  general  form  given  In  (2). 

Since  the  three  likelihood  displacements  considered  here  depend  only  on  n 
bj  and  h^,  other  summaries  might  include  various  combinations  or 
transformations  (e.g.,  to  Studentlzed  residuals)  of  these  quantities.  Such 
mixed  summaries  require  different  scales  for  interpretation  and  are  therefore 
somewhat  more  difficult  to  comprehend  than  constant  scale  summaries.  Of 
course,  bj  and  ht  might  be  useful  for  purposes  other  than  an  assessment  of 
influence. 

Finally,  equations  (12)  shows  one  way  to  generalize  DFFITS  beyond  linear 


models. 


APPENDIX 


Derivation  of  Equations  (10)  and  (11) 


By  definition. 


LD^(B,o2)  "  2[l(B,o  )  r  1>(B ^ j j ) ) ] 


where 


L<6,«2)  -  -  |  log  s2  -  |  -  |  log  2* 


•  _  »«  (y,  * 

L(6(l)°(l))  "  "  2  108  °(1)  r  2  A  J*2  J 

J  1  °(1) 


Since 


("-D  «H)  *  *(1) 
*2 


’(1) 


’(1) 


It  follows  that 


LDj (B ,<»  ) 


-  n  log 


(1) 


*(D 

°(i) 


-  1 


Now.  using  (8) 


20 


?  _  em(n~D 

LD.  (8,<J  )  -  n  log  — —  *  n  log  (1-b.)  +  xz - 1 

1  n_1  1  antl-bj) 

b1(n-1 ) 

-  n  ioe  .  n  loe  )<i-njT  -  1 


as  given  by  (10). 

To  derive  (11),  by  definition. 


LD^o2^)  -  2[L(0,c2)  -  LCgU^J.o^] 


a  2  ^2  ^ 

Since  the  maximum  likelihood  estimator  of  B  does  not  depend  on  c  ,  g(o^)  -  B 
and  thus 


L(B.o‘I)) 


n  ,  *2 

2  108  °(1) 


*2 

no 


2o 


n  _ 

2  log  2w 


(1) 


Then,  we  obtain 


o*"  A  2 

LD  (o2 1 0 )  -  n  log  4|*  *  n  (tt|— 

0  °(i> 


-  0 


Equation  (11)  now  follows  from  this  and  equation  (8). 
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